Identifying and correcting repeat-calling errors in nanopore sequencing of telomeres

Nanopore long-read sequencing is an emerging approach for studying genomes, including long repetitive elements like telomeres. Here, we report extensive basecalling induced errors at telomere repeats across nanopore datasets, sequencing platforms, basecallers, and basecalling models. We find that telomeres in many organisms are frequently miscalled. We demonstrate that tuning of nanopore basecalling models leads to improved recovery and analysis of telomeric regions, with minimal negative impact on other genomic regions. We highlight the importance of verifying nanopore basecalls in long, repetitive, and poorly defined regions, and showcase how artefacts can be resolved by improvements in nanopore basecalling models. Supplementary Information The online version contains supplementary material available at 10.1186/s13059-022-02751-6.

(SMRT) sequencing and nanopore sequencing, have been developed to generate sequence reads of over 10 kilobases from DNA molecules [12,13]. In SMRT Sequencing, the incorporation of DNA nucleotides is captured real-time via one of four different fluorescent dyes attached to each of the four DNA bases, thereby allowing the corresponding DNA sequence to be inferred. Sequencing of the same DNA molecule multiple times in a circular manner further allows a highly accurate consensus sequence of the DNA molecule to be generated in a process termed Pacific Biosciences (PacBio) High-Fidelity (HiFi) sequencing [12]. During nanopore sequencing, the ionic current, which varies according to the DNA sequence, is measured while a single-stranded DNA molecule passes through a nanopore channel. The electrical current measurement is then converted into the corresponding DNA sequence using a deep neural network trained on a collection of ionic current profiles of known DNA sequences [13]. Notably, both platforms enable long DNA molecules of more than 10 kilobase pairs to be routinely sequenced and are thus highly suited for the study of long repetitive elements like telomeres.

Results and discussion
In our analysis of telomeric regions with nanopore long-read sequencing in the recently sequenced and assembled CHM13 sample [14,15], we surprisingly observed that telomeric regions were frequently miscalled as other types of repeats in a strand-specific manner. Specifically, although human telomeres are typically represented by (TTA GGG ) n repeats (Additional file 1: Fig. S1a), these regions were frequently recorded as (TTA AAA ) n repeats ( Fig. 1a, b, Additional file 1: Fig. S1 and S2a). At the same time, when examining the reverse complementary strand of the telomeres which are represented as (CCC TAA ) n repeats, we instead observed frequent substitution of these regions by (CTT CTT ) n and (CCC TGG ) n repeats ( Fig. 1a, b, Additional file 1: Fig. S1 and S2b,c). Notably, these artefacts were not observed on the CHM13 reference genome [14,15], or PacBio HiFi reads from the same site ( Fig. 1a, b), suggesting that these observed repeats are artefacts of nanopore sequencing or the base-calling process, rather than real biological variations of telomeres. The examination of each telomeric long read also indicates that these error repeats frequently co-occur with telomeric repeats at the ends of each read (Fig. 1c, Additional file 1: Fig. S3), and are observed on all chromosomal arms of CHM13 (Additional file 1: Fig. S1b,c, Additional file 1: Fig. S4). Together, our results suggest that telomeric regions are frequently misrepresented as other types of repeats in a strand-specific manner during nanopore sequencing.
As human sub-telomeres are known to have a high degree of similarity to each other [16] which may lead to mis-mapping of reads between different chromosomal arms, we explored the level of read mis-mapping between different arms to assess if this might affect our analysis. We simulated long-reads (mean = 10kb) from the terminal 10 kb, 100 kb, and 1000 kb region of the CHM13 reference genome (Methods) and remapped them to the CHM13 assembly to measure the rate of misalignment. Remarkably, under a mapping quality threshold of ≥1, the mapping error rate was only ~0.03-0.3% for reads ranging in base accuracy between 95 and 99.9% at each of these regions (Additional file 1: Fig. S5). Even when a less stringent mapping quality cutoff value of 0 was applied, a relatively low mapping error rate of 0.3-1.2% was observed (Additional file 1: Fig. S5). As Strand-specific nanopore basecalling errors are pervasive at telomeres. a, b IGV screenshot illustrating the three types of basecalling errors found on the forward and reverse strands of telomeres for nanopore sequencing. (TTA GGG ) n on the forward strand of nanopore sequencing data was basecalled as (TTA AAA ) n while (CCC TAA ) n on the reverse strand was basecalled as (CTT CTT ) n and (CCC TGG ) n . PacBio HiFi data generated from the same cell line (CHM13) is depicted as a control. Reference genome indicated in the plot corresponds to the chm13 draft genome assembly (v1.0). c Co-occurrence heatmap illustrating the frequency of co-occurrence of repeats corresponding to natural telomeres, or to basecalling errors in PacBio HiFi and nanopore long-reads found at chromosomal ends (within 10kb of annotated end of the reference genome). Diagonal of co-occurrence matrix represents counts of long-reads with only a single type of repeats observed. d Basecalling errors at telomeres are observed across different nanopore datasets and sequencing platforms. e Basecalling errors at telomeres are observed for different nanopore basecallers and basecalling models. Guppy5 and the Bonito basecallers, and different bascalling models for each basecaller, were used to basecall telomeric reads in the CHM13 PromethION dataset (reads that mapped to flanking 10kb regions of the CHM13 reference genome). f Basecalling errors share similar nanopore current profiles as telomeric repeats. Current profiles for telomeric and basecalling error repeats were plotted based on known mean current profiles for each k-mer ("Methods"). g Summary of organisms assessed and the types of repeat errors observed. Note that S. pombe and D. melanogaster could not be readily assessed for the presence of error repeats by visualization in IGV as these sequences are more complex such, our results from reads simulation suggest that there is minimal level of read mismapping between different chromosomal arms in the CHM13 sample. We next assessed the sequencing coverage of each chromosomal arm in the CHM13 sample to establish if there may be biases in read coverage caused by read mis-mapping. We did not see strong biases in the coverage of nanopore reads on each chromosomal arm in the CHM13 sample (Additional file 1: Fig. S6), in line with the low mapping error rate in our simulation study. To evaluate if these errors are broadly observed in other studies or are specific to the CHM13 dataset from the Telomere-to-Telomere consortium, we examined the previously published NA12878 and HG002 nanopore genome sequencing datasets [12,13,17,18]. We observed the same basecalling errors, TTA GGG ➔TTA AAA , CCC TAA ➔CTT CTT , and CCC TAA ➔CCC TGG at telomeres in these datasets (Fig. 1d, Additional file 1: Fig. S7a). Remarkably, between 40 and 60% of reads at telomeric regions in these three datasets display at least one of these types of basecalling repeat artefacts for the nanopore sequencing platform (Additional file 1: Fig. S7b), while these errors were not observed in the PacBio HiFi datasets for the same samples (Additional file 1: Fig. S7b). We also partitioned these datasets based on the sequencing platforms used to generate them and noted that basecalling error repeats are observed across all three nanopore sequencing platforms (MinION, GridION, PromethION) (Fig. 1d, Additional file 1: Fig. S7a). These error repeats are a pervasive problem across nanopore sequencing datasets and sequencing platforms.
We then questioned if these error repeats are unique to specific nanopore basecallers or basecalling models. We extracted reads from chromosomal ends, and re-basecalled ionic current data of these reads using different basecallers and basecalling models. Using the production-ready basecaller Guppy5 (Oxford Nanopore Technologies), and the developmental-phase basecaller Bonito (Oxford Nanopore Technologies), we noticed that these basecalling error repeats can be readily observed across both basecallers (Fig. 1e, Additional file 1: Fig. S8 and S9). Further, these error repeats were also observed when different basecalling models were applied (Fig. 1e). Significantly, we also observed that the "fast" basecalling mode in Guppy led to almost complete loss of the (CCC TAA ) n strand (Fig. 1e, Additional file 1: Fig. S8a), while the "HAC" basecalling model enabled both strands to be recovered, highlighting that the basecalling model applied can affect the strand-specific recovery of telomeric reads. Together, these results suggest that error repeats are observable across nanopore basecallers and basecalling models.
We asked if there might be a difference in current profiles between the error-prone and less error-prone reads. To distinguish the error-prone reads from the less errorprone reads, we calculated the number of telomeric repeats ((TTA GGG ) 3 and (CCC TAA ) 3 ), and artefact repeats ((TTA AAA ) 3 , (CCC TGG ) 3 , and (CTT CTT ) 3 ) on each longread (Additional file 1: Fig. S10a-b). The proportion of repeat-calling errors on each read can then be established by dividing the number of artefact repeats by the total number of telomeric and artefact repeats (Additional file 1: Fig. S10c-d). While the majority of long-reads (69.5%) on the "CCC TAA " strand had an error proportion of >90%, only 5.2% of long-reads on the "TTA GGG " strand had an error proportion of <10%, suggesting that the repeat calling errors occur more frequently on the "CCC TAA " strand than on the "TTA GGG " strand. We then examined the current profiles of the more error-prone reads (i.e., reads with a higher proportion of repeat calling errors) and the less errorprone reads. We were not able to observe an obvious visual difference in current profiles between the reads with a higher proportion (>0.9) of repeat calling errors (Additional file 1: Fig. S11a-c) versus the reads with a lower proportion (<0.4) of repeat calling errors (Additional file 1: Fig. S11d-f ).
To determine the cause for these repeat-calling errors, we examined the ionic current profiles of true telomeric repeats and artifactual error repeats. We extracted known mean current values of each 6-mer and its six circular permutations (e.g., TTA GGG , TAG GGT , and AGGTT) and generated their ionic current profiles (Methods). Remarkably, we observed a high degree of similarity between current profiles between telomeric repeats and these basecalling errors (Fig. 1f ). Specifically, we observed that (TTA GGG ) n telomeric repeats had a high degree of similarity with the (TTA AAA ) n error repeats generated by the Bonito base-caller (Pearson correlation = 0.9928, Euclidean distance=4.9934) (Additional file 1: Fig. S12a-c). Similarly, (CCC TAA ) n current profile also showed high similarity with (CCC TGG ) n repeats (Pearson correlation = 0.9783, Euclidean distance = 4.687), and reasonably good similarity with (CTT CTT ) n repeats (Pearson correlation = 0.6411, Euclidean distance = 19.384) (Additional file 1: Fig. S12ac). Together, these results suggest that similarities in current profiles between repeat sequences are possible causes for repeat-calling errors at telomeric repeats.
We then examined if repeat-calling errors may extend to other repetitive sequences beyond telomeric sequences. To address this, we search for other repeat pairs with similar current profiles that may be susceptible to these repeat-calling errors. We simulated and performed pairwise comparison of current profiles for all 6-mer repeats (n =8,386,560 comparisons) (Methods). Using similar Pearson correlation (≥0.99) and Euclidean distance cutoffs (≤5) as observed for telomeric repeat errors identified in this study (Additional file 1: Fig. S12a-c), we identified a further 2577 pairs of repeats with similar current profiles (Additional file 2: Table S1, Additional file 1: Fig. S12d). For instance, we found that (TTA GGG ) n telomeric repeats also showed high similarities in current profiles with repeats with single-nucleotide substitutions like (TTA AGG) n , (TTAG AG) n , and (TTGGGG) n (Additional file 1: Fig. S12d,e). Repeat sequences like (GCT GCT ) n and (AAC GGC) n that differed drastically at the sequence level, but shared similar current profiles were also observed (Additional file 1: Fig. S12d,f ). Further, we also examined the unmappable pool of CHM13 nanopore reads after mapping it to the CHM13 reference assembly. Remarkably, a significant pool of reads with long (GT) n repeats was readily observed (Additional file 1: Fig. S13). Interestingly, (GTG TGT ) n repeats were also found to have high similarities in current profiles with (CTC TCT ) n repeats (Additional file 1: Fig. S12d, Additional file 2: Table S1), suggesting that the pool of unmappable (GT) n reads may include (CT) n repeats. Collectively, our results suggest that these basecalling error repeats may be observed at other repetitive regions, beyond telomeres.
It is interesting to note that telomere-like sequences are also frequently found near telomeric regions [19][20][21][22]. Specifically, there are three main types of telomere-like repeat sequences that are frequently found near telomeres in the human genome, namely the c-type repeats (TCA GGG ) n , g-type repeats (TGA GGG ) n, and , j-type repeats (TTG GGG ) n [23]. We asked if these telomere-like repeat sequences might also be basecalled incorrectly, similar to what we have observed at telomeres with (TTA GGG ) n repeat sequences. We therefore identified these telomere-like repeat regions from the CHM13 reference genome (Methods), and visually inspected them in IGV. These telomere-like repeat sequences could also be miscalled into repeat sequences of other repeat monomer length. For instance, we observed that the 6-mer (CCC TCA ) n repeats could get miscalled into the 5-mer (CCTCA) n repeat sequence (Additional file 1: Fig. S14a). (CCC TGA ) n and (TCA GGG ) n 6-mer repeats could also get miscalled into (CCTGA) n 5-mer repeats and (TCA GGG G) n 7-mer repeats respectively (Additional file 1: Fig. S14b). Further, the 6-mer (TTG GGG ) n repeat was observed to be miscalled into the 7-mer (TTG GGG G) n repeats (Additional file 1: Fig.  S14c). We explored the current profiles for these repeats (10 consecutive repeats) using known current values for each 6-mer repeats (Additional file 1: Fig. S15). Remarkably, even though these repeat sequences were of different length, we see that these sequences can still share a highly similar current profile (Additional file 1: Fig. S15a,b,d,e,g). Of note, other 6-mer repeats were also predicted to have similar current profiles as these three types of telomere-like repeat sequences (Additional file 1: Fig. S16). Together, these suggest that the repeat miscalling errors can also be observed on these telomere-like repeat sequences. More broadly, our results also show that repeat sequences of different lengths (i.e. 6-mer vs. 5-mers and 6-mers vs. 7-mers) can share similar current profiles, and be miscalled between each other.
To see if these repeat calling errors might extend to the telomeres of other organisms, we obtained nanopore genome sequencing dataset corresponding to eight model organisms covering a wide spectrum of the tree of life from the NCBI SRA database (Fig. 1g, Additional file 2: Table S2 and S3) [24]. These eight different organisms are Arabidopsis thaliana [25,26], Caenorhabditis elegans [27], Gallus gallus (chicken), Drosophila melanogaster [28], Mus musculus (mouse) [29,30], Saccharomyces cerevisiae [31], Schizosaccharomyces pombe, and Danio rerio (zebrafish) [32,33] which are all widely studied and have high-quality reference genomes. At the telomeres, these organisms are known to have (TTA GGG ) n telomeric repeats as humans do (chicken, zebrafish, mouse) [34][35][36], (TTT AGG G) n repeats (A. thaliana) [37], (TG 1-3 ) n repeats (S. cerevisiae) [38], TTAC(A)(C)G 2-8 (S. pombe) [39], (TTA GGC ) n repeats (C. elegans) [40], or retrotransposons (D. melanogaster) [41] (Fig. 1g). As raw current data was not available for all datasets, we directly utilized sequence data that was published by the authors of these studies. As expected, we also observed repeat calling errors on telomeres in organisms with (TTA GGG ) n -type repeats (Additional file 1: Fig. S17a, S18), akin to what we observed in humans. Interestingly, we also observed similar telomeric repeat errors as in humans in A. thaliana which are known to have 7-mer (TTT AGG G) n repeats (note that humans have a slightly different repeat sequence of TTA GGG ) (Additional file 1: Fig. S17b, S19), which suggests that these repeats need not be 6-mer repeats (approximate number of nucleotides detected by the nanopore at each time) for errors to be observed. In C. elegans, (CTT GGG ) n repeat errors instead of (TTA GGC ) n telomeric repeats could also be detected in one of the two datasets assessed (Additional file 1: Fig. S17b, S20). We did not observe repeat errors for S. cerevisiae which are known to have (TG 1-3 ) n repeats at their telomeres (Additional file 1: Fig. S17c, S21), suggesting that these repeat errors do not occur on telomeres of all organisms. In organisms like S. pombe with more complex telomeric repeat sequences, some strand bias could be observed though we were unable to observe specific error motifs (Additional file 1: Fig.  S17d). For D. melanogaster, which elongates telomeres via a retro-transposition-based mechanism, it was not possible to assess the frequency of repeats. Nonetheless, there was no observable difference in basecalling between the two strands at the ends of the D. melanogaster reference genome (Additional file 1: Fig. S22). Together, our results suggest that repeat calling errors in nanopore sequencing can be observed at telomeres of some other organisms beyond human telomeres.
To resolve these basecalling errors at telomeres, we attempted to tune the nanopore basecaller by providing it with more training examples of telomeres (Fig. 2a). Notably, model training was performed with a low learning rate to ensure that the majority of the model does not get affected during training while ensuring that minor adjustments in the model can be made to accurately basecall telomeres. Specifically, we tuned the deep neural network model underlying the Bonito basecaller by training it at a low learning rate with ground truth telomeric sequences extracted from the CHM13 reference genome, and current data of the corresponding reads (Methods). As two nanopore PromethION runs were performed on the CHM13 dataset, we used the data from one run (run225) for training and tuning of the basecaller and held out the data from the second run (run 226) for evaluation of our tuned basecaller. With this approach, we see a significant improvement in the basecalls of both the telomeres and sub-telomeric regions on the training data and held out dataset with a clearly observable decrease in errors on the chromosomal ends (Fig. 2b, Additional file 1: Fig. S23a-d).
As it is computationally more efficient to redo repeat-calling only for the small fraction of problematic telomeric reads rather than all reads, we developed an overall strategy to select these telomeric reads for re-basecalling with the tuned Bonito+telomeres basecaller (Fig. 2c). To select telomeric reads for selective rebasecalling, we relied on an observation from the CHM13 reference genome and nanopore sequencing datasets. Specifically, we noticed that telomeric reads which are mapped to the ends of the CHM13 reference genome tend to show a high frequency of telomeric, or basecalling error repeats as compared to the rest of the genome (Additional file 1: Fig. S24). We therefore utilized this observation to separate the non-telomeric reads, from the candidate telomeric reads (Fig. 2c, "Methods"). These telomeric reads were then re-base-called with the tuned Bonito basecaller before being recombined with the pool of non-telomeric reads. Remarkably, with this strategy, we observed a significant improvement in recovery of telomeric reads with (TTA GGG ) n and (CCC TAA ) n repeats (from 384 to 476 TTA GGG and 373 to 686 CCC TAA reads) (Fig. 2d). At the same time, a sharp reduction of these basecalling repeat errors was also observed (151 to 17 TTA AAA reads, 561 to 48 CTT CTT reads, and 337 to 20 CCC TGG reads) (Fig. 2d). Our "selective tuning" approach for fixing basecalling errors at telomeres can improve recovery of telomeric reads while reducing telomeric basecalling repeat artefacts.
We further evaluated our approach for possible impact on overall basecalling accuracy. While a reduction in global basecalling accuracy was observed (~1-2%) when our tuned basecaller was directly applied to the full dataset, caused likely by miscalling of endogenous (CTT CTT ) n genomic repeats as (CCC TAA ) n , this loss of global basecalling accuracy could be avoided by applying our basecaller to telomeric reads alone. Concordant with this, we did not observe changes in overall basecalling accuracy with our telomere-selective tuning approach (Fig. 2e). These results indicate that Fig. 2 Selective re-basecalling of telomeric reads resolves basecalling errors at telomeres. a Approach for tuning the bonito basecalling model for improving basecalls at telomeres. b Tuned bonito basecalling model leads to improvement in basecalls at telomeric regions. IGV screenshots of the telomeric region (chr2q) in the CHM13 dataset basecalled using the default bonito basecaller, and the tuned bonito basecalling model is as depicted. c Overall approach for selecting and fixing telomeric reads in nanopore sequencing datasets. Telomeric reads are selected ("Methods") and rebasecalled using the tuned bonito basecalling model. d The selective tuning approach leads to improved recovery of telomeric reads, and a decrease in the number of reads with basecalling artefacts. Evaluation was performed on the held-out test dataset (run226). e The "selective tuning" approach leads to little detected negative impact on basecalling of other genomic regions. The sequence similarity of all reads to the reference genome for three approaches for basecalling of nanopore reads was evaluated. They are applying the default bonito basecalling model to all reads (untuned bonito model), applying the tuned bonito basecalling model to all reads (tuned bonito model), and applying the tuned bonito basecalling model selectively to telomeric reads only (selective tuning of telomeric reads). The density plot depicts the sequence similarity of each read against the CHM13 reference genome as assessed using minimap2 our telomere-selective tuning approach has a negligible impact on basecalling accuracy for the rest of the genome.

Conclusion
In this study, we showed that basecalling errors can be widely observed at telomeric regions across nanopore datasets, sequencing platforms, basecallers, and basecalling models. These repeat errors further extend to telomeres of other organisms with (TTA GGG ) n repeats, to organisms with non-(TTA GGG ) n repeats, and also to repeats with different monomer length. We further showed that these strand-specific basecalling errors were likely induced by similarities in current profiles between different repeat types. To resolve these basecalling errors at telomeres, we devised an overall strategy to re-basecall telomeric reads using a tuned nanopore basecaller. More broadly, our study highlights the importance of verifying nanopore basecalls in long, repetitive, and poorly defined regions of the genome. For instance, this can be done either with an orthogonal platform or at a minimum by ensuring nanopore basecalls between opposite strands are concordant. An extensive evaluation of genome-wide basecalling errors in repeat regions is also needed in the future given our observations at telomeric regions. Nonetheless, we anticipate that subsequent further improvements in the nanopore basecaller or basecalling model as demonstrated in this study will potentially lead to the reduction or elimination of these basecalling artefacts.

Extraction of candidate telomeric reads
Telomeric reads were extracted by mapping all reads to the CHM13 draft genome assembly (v1.0) obtained from the telomere-to-telomere consortium using Minimap2 (version 2.17-r941) [43]. Subsequent to that, reads that mapped to within 10 kilobase pairs of the start and end of each autosome and X-chromosome were then extracted using SAMtools (version 1.10) [44].

Co-occurrence matrix
Candidate PacBio HiFi and Nanopore telomeric reads were first extracted as described above and then converted into the FASTA format using SAMtools (version 1.10) [44]. Subsequent to that, custom Python scripts were used to assess if each of the reads contain at least four consecutive counts of the repeat sequence of interest (e.g. (TTA GGG ) 4 ). This information is then used to generate a pair-wise correlation matrix as depicted with R in the main text.

Current profiles for different repeat sequences
The mean current level for different k-mers sequenced by nanopore sequencing was obtained from the k-mer models published by Oxford Nanopore (https:// github. com/ nanop orete ch/ kmer_ models/ tree/ master/ r9.4_ 180mv_ 450bps_ 6mer). Circular permutations of each 6-mer of interest were generated, and their corresponding mean current level was extracted from the k-mer models. The current profiles for each of the indicated repeat sequences were then plotted and depicted in the figure.

Pairwise comparison of all possible k-mers
Current profile for each 6-mer repeat sequence was generated using the published k-mer models as described above. Pairwise comparisons of all possible 6-mer repeat current profiles were then performed (8,386,560 pairs in total). A corresponding (i) Pearson correlation value, (ii) mean-centered Euclidean distance, and (iii) mean current difference for each pair of 6-mer repeat current profiles were then generated. Pairs of repeats with a Pearson correlation value ≥ 0.99 and Euclidean distance ≤ 5 were selected as putative repeat pairs that can be miscalled.

Tuning of bonito model
The default model from Bonito v0.3.5 (commit d8ae5eeb834d4fa05b441dc-8f034ee04cb704c69) was used as the base model for model tuning. The training dataset needed for the training process was generated from the telomeric reads from a Prome-thION run in the CHM13 dataset (run225). More broadly, we then generate the training dataset by matching the current profiles from the nanopore run to ground truth sequences that we extracted from the CHM13 draft reference genome assembly (v1.0) using custom written code. Specifically, these telomeric reads were first basecalled using the initial Bonito basecalling model and then mapped back to the CHM13 draft reference genome assembly (v1.0). This allowed each telomeric read to be properly assigned to its corresponding chromosomal arm with its sub-telomeric sequence. Nonetheless, as the telomeric region of the same read could not be properly mapped to the telomeric repeats due to the repeat errors, there was difficulty in assigning the nanopore current data to the correct ground truth sequences in the reference genome. As such, the presumed length of sequences to extract was estimated using the basecalling repeat error sequences, and the same length of sequences was then extracted from the CHM13 reference genome to serve as ground truth sequences. With this idea and with a custom Perl script, we were able to generate a set of ground truth sequences and signals for model tuning. These data were then formatted into the corresponding Python objects required by the Bonito basecaller with custom Python scripts. Using the tune function in Bonito and with our prepared training dataset, we were then able to train the basecaller to convergence.

Selective application of tuned basecaller to telomeric reads
We applied our tuned basecaller by first extracting candidate telomeric reads for rebasecalling. This was done by enumerating the total 3-mer telomeric (i.e., (TTA GGG ) 3 , (CCC TAA ) 3 ) and repeat artefact count (i.e. (TTA AAA ) 3 , (CTT CTT ) 3 , (CCC TGG ) 3 ) on each read. Reads with at least 10 total counts of these repeats were isolated and their readnames noted. These reads were then excluded from the total pool of reads via their readnames, and basecalled separately using our tuned basecaller using the fast5 data of these reads. Following basecalling with the tuned basecaller, these reads were then recombined with the main pool of reads.
The corresponding fastq files for each of these runs were then downloaded from the SRA database and then mapped to their corresponding reference genomes using minimap2 with the parameter -x map-ont. The reference genomes used for read mapping of each of the organisms are as follows: A. thaliana (TAIR10), C. elegans (ce11), Chicken (galGal6), D. melanogaster (dm6), Mouse (mm39), S. cerevisiae (sacCer3), S. pombe (https:// www. pomba se. org/ data/ genome_ seque nce_ and_ featu res/ genome_ seque nce/ Schiz osacc harom yces_ pombe_ all_ chrom osomes. fa. gz), and Zebrafish (danRer11). Alignments of these nanopore datasets for each of these organisms were then visualized together with their corresponding reference genomes in IGV at the annotated terminal ends. Note that as not all chromosomal ends were well assembled in these organisms, only selected chromosomal arms could be readily visualized and inspected in IGV for the presence of these repeat calling errors. To generate plots summarizing the frequency of telomeric repeats and repeat errors in each organism, reads on the terminal 10kb region of each chromosomal arms were extracted. The only exception was mouse in which the terminal 500kb region of the reference genome was extracted as the ends of the reference genome were padded by very long stretches of NNNs.

Sequencing coverage of each chromosomal arms
The number of reads at 10kb, 100kb, and 1000kb of the annotated ends at each chromosomal arm in the CHM13 reference was extracted using SAMtools [44] and then counted. Boxplot corresponding to the distribution of reads observed on each arm was then generated using R.

Simulation of long-reads to assess mismapping rates at sub-telomeres in CHM13
PBSIM2 [45] was used to simulate long-reads from the CHM13 reference genome with the parameters --depth 100 --length-min 5000 --length-mean 10000 --accuracy-mean 0.95 --hmm_model R94.model. In some instances, we also modified the read accuracy from 0.95 to 0.98 or to 0.999 to assess the impact of the read accuracy on the mismapping rate. The pbsim2fq function in the paftools.js script (distributed as part of mini-map2) [43] was then used to generate fastq files with readnames corresponding to the true read positions from the .maf files from PBSIM2. Reads that originated from the terminal 1000kb, 100kb, or 10kb region of the CHM13 reference genome (i.e., overlap with these regions with least one base-pair) were then extracted and then mapped to the CHM13 reference genome using minimap2 (version 2.17-r941) [43]. The mapeval function in the paftools.js script was then used to evaluate the accuracy of read mapping of reads extracted from these regions.

Evaluation of errors at telomere-like repeat regions
To evaluate the presence of repeat calling errors at telomere-like regions, we first identified regions in the CHM13 reference genome with telomere-like repeat sequences. This was done by mapping the CHM13 reference to an artificial reference containing 600 repeats of each of the three types of telomere-like repeats. In all, we identified 7 regions with (TCA GGG ) n , 7 regions with (TGA GGG ) n , and 5 regions with (TTG GGG ) n repeats that are at least 100 bp in length. Each of these regions was then manually inspected in IGV for the occurrence of these repeat errors.

Evaluation of more error-prone and less error-prone reads
To establish which reads are more error-prone or less error-prone, we calculated the number of non-overlapping telomeric repeats ((TTA GGG ) 3 and (CCC TAA ) 3 ) and artefact repeats ((TTA AAA ) 3 , (CCC TGG ) 3 , and (CTT CTT ) 3 ) on each long-read using custom Python scripts. The proportion of repeat errors on each long-read was then calculated by dividing the number of artefact repeats on each long-read with the total number of telomeric and artefact repeats.